This is a capstone project as a part of the Google Data Analytics Professional Certificate course. The project involves using R programming language and RStudio IDE to analyse a dataset. The project follows six steps: Ask, Prepare, Process, Analyse, Share, and Act. These steps involve defining the problem or question to be answered, preparing and cleaning the data, analysing the data statistically and through visualisations, sharing the insights obtained from the analysis, and taking action based on those insights.
3. Process
The libraries and dataset are loaded in RStudio environment and the data is prepared for analysis by doing necessary cleaning, manipulation, and transformation.
Loading necessary packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\arpit\AppData\Local\Temp\Rtmp6veo0u\downloaded_packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
install.packages("lubridate", repos = "http://cran.us.r-project.org")
## Warning: package 'lubridate' is in use and will not be installed
library(lubridate)
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Warning: package 'dplyr' is in use and will not be installed
library(dplyr)
install.packages("readr", repos = "http://cran.us.r-project.org")
## Warning: package 'readr' is in use and will not be installed
library(readr)
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Warning: package 'ggplot2' is in use and will not be installed
library(ggplot2)
install.packages("tiydr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## Warning: package 'tiydr' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
library(tidyr)
install.packages("janitor", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'janitor' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\arpit\AppData\Local\Temp\Rtmp6veo0u\downloaded_packages
library(janitor)
## Warning: package 'janitor' was built under R version 4.2.3
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
install.packages("ggmap", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'ggmap' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\arpit\AppData\Local\Temp\Rtmp6veo0u\downloaded_packages
library(ggmap)
## Warning: package 'ggmap' was built under R version 4.2.3
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
install.packages("geosphere", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'geosphere' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\arpit\AppData\Local\Temp\Rtmp6veo0u\downloaded_packages
library(geosphere)
## Warning: package 'geosphere' was built under R version 4.2.3
install.packages("modeest", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'modeest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\arpit\AppData\Local\Temp\Rtmp6veo0u\downloaded_packages
library("modeest")
## Warning: package 'modeest' was built under R version 4.2.3
## Registered S3 method overwritten by 'rmutil':
## method from
## print.response httr
Loading the data
Jan_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202201-divvy-tripdata.csv")
Feb_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202202-divvy-tripdata.csv")
Mar_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202203-divvy-tripdata.csv")
Apr_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202204-divvy-tripdata.csv")
May_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202205-divvy-tripdata.csv")
Jun_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202206-divvy-tripdata.csv")
Jul_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202207-divvy-tripdata.csv")
Aug_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202208-divvy-tripdata.csv")
Sep_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202209-divvy-tripdata.csv")
Oct_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202210-divvy-tripdata.csv")
Nov_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202211-divvy-tripdata.csv")
Dec_2022 <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Cyclistic Bike Share Case Study/Dataset/202212-divvy-tripdata.csv")
Combining the data into one dataframe
df <- rbind(Jan_2022, Feb_2022, Mar_2022, Apr_2022, May_2022, Jun_2022,
Jul_2022, Aug_2022, Sep_2022, Oct_2022, Nov_2022, Dec_2022)
Viewing the first few rows of the data
head(df)
Viewing the structure of the data
str(df)
## 'data.frame': 5667717 obs. of 13 variables:
## $ ride_id : chr "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
## $ ended_at : chr "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num 42 42 41.9 42 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 42 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr "casual" "casual" "member" "casual" ...
Cleaning the data
Checking for null values in the dataset for columns having string data type and replacing them with mode values
cols_to_check <- c("ride_id", "rideable_type", "started_at", "ended_at", "start_station_name",
"start_station_id", "end_station_name", "end_station_id", "member_casual")
for (col in cols_to_check) {
mode_val <- names(sort(table(df[[col]], exclude = NA), decreasing = TRUE))[1]
df[[col]][is.na(df[[col]])] <- mode_val
}
Checking for null values in the dataset for columns having numerical data type and replacing them with mean values
cols_to_check <- c("start_lat", "start_lng", "end_lat", "end_lng")
for (col in cols_to_check) {
if (sum(is.na(df[[col]])) > 0) {
df[[col]] <- replace_na(df[[col]], modeest::mfv(df[[col]]))
}
}
Handling the start_station_name variable having empty string values by replacing them with “NA”
df %>%
group_by(start_lat, start_lng) %>%
mutate(start_station_name = na_if(start_station_name, "")) %>%
fill(start_station_name)
Handling the end_station_name variable having empty string values by replacing them with “NA”
df %>%
group_by(end_lat, end_lng) %>%
mutate(end_station_name = na_if(end_station_name, "")) %>%
fill(end_station_name)
head(df)
Additional data cleaning by creating new columns by extracting data from existing columns and bringing them in the correct format
df_cleaned <- df %>%
mutate(start_time = ymd_hms(started_at),
end_time = ymd_hms(ended_at),
start_hour = hour(start_time),
end_hour = hour(end_time),
start_day = wday(start_time, label = TRUE),
end_day = wday(end_time, label = TRUE),
start_month = month(start_time, label = TRUE),
end_month = month(end_time, label = TRUE),
ride_length = as.numeric(difftime(end_time, start_time, units = "mins")),
ride_length_bucket = case_when(ride_length < 10 ~ "< 10",
ride_length < 20 ~ "10-20",
ride_length < 30 ~ "20-30",
ride_length < 45 ~ "30-45",
ride_length < 60 ~ "45-60",
ride_length < 90 ~ "60-90",
TRUE ~ "90+"),
member_casual = case_when(member_casual == "member" ~ "Annual Member",
member_casual == "casual" ~ "Casual Rider"))
Making the ride_length and the ride_Id consistent: Some ride length durations might be negative, meaning the start time may exceed the end time, hence they have been filtered out. Also, duplicate Ride IDs have been filtered out.
df_cleaned <- filter(df_cleaned, ride_length > 0 & !duplicated(ride_id))
Adding ride distance in km using latitude and longitude data
df_cleaned$ride_distance <- distGeo(matrix(c(df_cleaned$start_lng, df_cleaned$start_lat), ncol = 2), matrix(c(df_cleaned$end_lng, df_cleaned$end_lat), ncol = 2))
df_cleaned <- df_cleaned %>% filter(ride_distance > 0)
df_cleaned$ride_distance <- df_cleaned$ride_distance/1000 #distance in km
Dropping the variables not required in this analysis
df_cleaned <- subset(df_cleaned, select = -c(started_at,ended_at,start_station_id,end_station_id))
head(df_cleaned)
4. Analyse
Descriptive Summary of different variables
summary(df_cleaned)
## ride_id rideable_type start_station_name end_station_name
## Length:5348723 Length:5348723 Length:5348723 Length:5348723
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. : 0.00 Min. :-88.14
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.37 Max. : 0.00
##
## member_casual start_time
## Length:5348723 Min. :2022-01-01 00:01:00.00
## Class :character 1st Qu.:2022-05-29 11:45:48.50
## Mode :character Median :2022-07-23 01:54:40.00
## Mean :2022-07-20 19:43:53.51
## 3rd Qu.:2022-09-16 16:54:57.00
## Max. :2022-12-31 23:59:26.00
##
## end_time start_hour end_hour start_day
## Min. :2022-01-01 00:04:02.00 Min. : 0.00 Min. : 0.00 Sun:722765
## 1st Qu.:2022-05-29 12:10:33.50 1st Qu.:11.00 1st Qu.:11.00 Mon:707490
## Median :2022-07-23 02:08:13.00 Median :15.00 Median :15.00 Tue:743197
## Mean :2022-07-20 19:59:50.00 Mean :14.21 Mean :14.35 Wed:759078
## 3rd Qu.:2022-09-16 17:10:59.00 3rd Qu.:18.00 3rd Qu.:18.00 Thu:798962
## Max. :2023-01-01 18:09:37.00 Max. :23.00 Max. :23.00 Fri:758364
## Sat:858867
## end_day start_month end_month ride_length
## Sun:727274 Jul : 776028 Jul : 776091 Min. : 0.02
## Mon:708238 Aug : 743583 Aug : 743618 1st Qu.: 6.03
## Tue:743062 Jun : 723908 Jun : 723821 Median : 10.35
## Wed:758801 Sep : 666184 Sep : 666142 Mean : 15.94
## Thu:798252 May : 592681 May : 592686 3rd Qu.: 18.15
## Fri:756025 Oct : 532212 Oct : 532275 Max. :34354.07
## Sat:857071 (Other):1314127 (Other):1314090
## ride_length_bucket ride_distance
## Length:5348723 Min. : 0.000
## Class :character 1st Qu.: 1.002
## Mode :character Median : 1.659
## Mean : 2.265
## 3rd Qu.: 2.885
## Max. :9817.319
##
Creating a variable for summary statistics by user type
ride_summary <- df_cleaned %>%
group_by(member_casual) %>%
summarize(total_rides = n(),
percentage_rides = (n() / nrow(df_cleaned)) * 100,
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance))
ride_summary
- Annual Members take more number of rides as compared to the Casual Riders and account for almost 60% of the rides.
- Casual Riders use bikes for a longer duration on an average as compared to the Annual Members.
- Average Ride Distance is approximately same for both the type of users.
Creating a variable for ride summary by user type for different months of the Year 2022
ride_month_summary <- df_cleaned %>%
group_by(member_casual, start_month) %>%
summarize(total_rides = n(),
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance), .groups = 'drop') %>%
ungroup()
ride_month_summary
- Both type of users took most rides between May-October and least rides between December-February.
- Average Ride Distance has stayed between 1.8-2.5 Km for both type of users.
- Average Ride Length has been between 10.4-13.6 Minutes for Annual Members, whereas it has been between 13.3-24.6 Minutes for Casual Riders.
Creating a variable for ride summary by user type for different days of the week
ride_day_summary <- df_cleaned %>%
group_by(member_casual, start_day) %>%
summarize(total_rides = n(),
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance), .groups = 'drop') %>%
ungroup()
ride_day_summary
- Annual Members took most rides during weekdays and least on weekends.
- On the contrary, Casual Riders took must rides on weekends and least during the weekdays.
- Average Ride Length has been higher for both type of users on weekends as compared to those on weekdays.
- There is not much variation in the Average Ride Distance among the two user types.
Creating a variable for ride summary by user type for different hours in a day
ride_hour_summary <- df_cleaned %>%
group_by(member_casual, start_hour) %>%
summarize(total_rides = n(),
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance), .groups = 'drop') %>%
ungroup()
ride_hour_summary
- Annual Members took most rides between 7am-9pm.
- Casual Riders took most rides between 11am-8pm.
Creating a variable summarising by type of ride and user type
ride_type_summary <- df_cleaned %>%
group_by(rideable_type, member_casual) %>%
summarize(total_rides = n(),
percentage_rides = (n() / nrow(df_cleaned)) * 100,
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance), .groups = 'drop') %>%
ungroup
ride_type_summary
- Annual Members used Classic Bikes more accounting for almost 31%.
- Casual Riders used Electric Bikes more accounting for almost 29%.
- Docked Bikes were used only by Casual Riders that too in a very less proportion (2.7%).
Creating a variable summarising by type of ride for different months of the Year 2022
ride_type_month_summary <- df_cleaned %>%
group_by(rideable_type, start_month) %>%
summarize(total_rides = n(),
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance), .groups = 'drop') %>%
ungroup()
ride_type_month_summary
- All three Types of Bikes were used the most between May-October and least between December-February.
- Average Ride Length for Docked Bikes is much higher than that of Classic and Electric Bikes.
Creating a variable summarising by user type and ride bucket
ride_bucket_summary <- df_cleaned %>%
group_by(member_casual, ride_length_bucket) %>%
summarize(total_rides = n(),
percentage_rides = (n() / nrow(df_cleaned)) * 100,
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance), .groups = 'drop') %>%
ungroup()
ride_bucket_summary
- Annual Members took 33% of rides and Casual Riders took 15% of rides which lasted for less than 10 minutes.
- For both the type of users, least number of rides lasted for >90 minutes duration.
Creating a variable summarising by number of rides for different start stations
df_cleaned$start_station_name[df_cleaned$start_station_name == ""] <- NA
ride_start_stations <- df_cleaned %>% drop_na(start_station_name) %>%
group_by(start_station_name) %>%
summarize(total_rides = n()) %>%
arrange(desc(total_rides))
ride_start_stations
- Most rides started from Streeter Dr & Grand Ave station.
- Least rides started from Komensky Ave & 59th St station.
Creating a variable summarising by number of rides for different end stations
df_cleaned$end_station_name[df_cleaned$end_station_name == ""] <- NA
ride_end_stations <- df_cleaned %>% drop_na(end_station_name) %>%
group_by(end_station_name) %>%
summarize(total_rides = n()) %>%
arrange(desc(total_rides))
ride_end_stations
- Most rides ended at Streeter Dr & Grand Ave station.
- Least rides ended at Altgeld Gardens station.
Creating a new dataframe with the start station, end station, and number of rides between each pair
station_pairs <- df_cleaned %>% drop_na(start_station_name, end_station_name) %>%
group_by(start_station_name, end_station_name) %>%
summarize(total_rides = n(),
avg_ride_length = mean(ride_length),
avg_ride_distance = mean(ride_distance), .groups = 'drop') %>%
ungroup() %>%
filter(total_rides >= 50) %>%
arrange(desc(total_rides))
station_pairs
- Most rides took place between Ellis Ave & 60th St station and University Ave & 57th St station.
- Least rides took place between Clark St & Armitage Ave station and Lincoln Ave & Fullerton Ave station.
5. Share
Creating a bar chart for number of rides by user type
ggplot(ride_summary, aes(x = member_casual, y = total_rides, fill = member_casual)) +
geom_col() +
labs(title = "Number of Rides by User Type",
x = "User Type",
y = "Number of Rides")

- Annual Members took more rides than the Casual Riders.
Creating a pie chart for Percentage of Rides by user type
# Creating a basic bar
pie = ggplot(ride_summary, aes(x="", y=percentage_rides, fill=member_casual)) + geom_bar(stat="identity", width=1)
# Converting to pie (polar coordinates) and add labels
pie = pie + coord_polar("y", start=0)
# Removing labels and add title
pie = pie + labs(x = NULL, y = NULL, fill = NULL, title = "Percentage of Rides by User Type")
# Tidying up the theme
pie = pie + theme_classic() + theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5))
pie

- Annual Members accounted for more than 50% of the rides.
Creating a bar chart for average ride length by user type
ggplot(ride_summary, aes(x = member_casual, y = avg_ride_length, fill = member_casual)) +
geom_col() +
labs(title = "Average Ride Length by User Type",
x = "User Type",
y = "Average Ride Length (Minutes)")

- Average Ride Length for Casual Riders is much higher than that of Annual Members.
Creating a bar chart for average ride distance by user type
ggplot(ride_summary, aes(x = member_casual, y = avg_ride_distance, fill = member_casual)) +
geom_col() +
labs(title = "Average Ride Distance by User Type",
x = "User Type",
y = "Average Ride Distance (Km)")

- Both type of users travel about the same Average Distance.
- This similarity could be possible due to the fact that Annual Members take (same ride time) rides throughout the week, but Casual Riders took rides mostly on weekends with a higher Ride Length.
Creating a Bar chart for Number of rides by month for each user type
ggplot(ride_month_summary, aes(x = start_month, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Month", y = "Number of Rides", title = " Number of Rides by User Type for different Months")

- Maximum Rides were taken during May-October by both type of users.
- Least Rides were taken in the months of January, February and December by both type of users. This might be due to the cold weather in Winter season.
Creating a Bar chart for average ride length by month for each user type
ggplot(ride_month_summary, aes(x = start_month, y = avg_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Month", y = "Average Ride Length (Minutes)", title = "Average Ride Length by User Type for different Months")

- Average Ride Length is same throughout the year (< 15 Minutes) for Annual Members.
- Average Ride Length is between 12.5-25 Minutes throughout the year for Casual Members.
- Average Ride Length is highest in the months of January, March and May and decreases from July-December for Casual Members.
- In the months of January and February, Average Ride Length is higher but Number of Rides are lowest as compared to other months.
Creating a Bar chart for average ride distance by month for each user type
ggplot(ride_month_summary, aes(x = start_month, y = avg_ride_distance, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Month", y = "Average Ride Distance (Km)", title = "Average Ride Distance by User Type for different Months")

- Average Ride Distance is high during March-September, and low in January, February and December for Casual Riders.
- Average Ride Distance is high during May-September and November, and is low in January, February and December for Annual Members.
Creating a Bar chart for Number of rides by day for each user type
ggplot(ride_day_summary, aes(x = start_day, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Day", y = "Number of Rides", title = " Number of Rides by User Type for different Days")

- Annual Members took most rides during weekdays and least during weekends.
- Casual Riders took most rides during weekends and least during weekdays.
Creating a Bar chart for average ride length by day for each user type
ggplot(ride_day_summary, aes(x = start_day, y = avg_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Day", y = "Average Ride Length (Minutes)", title = " Average Ride Length by User Type for different Days")

- Average Ride Length follows the same pattern as the Number of Rides for both type of users.
- For Annual Members, Average Ride Length is about the same throughout the week (< 15 Minutes).
Creating a Bar chart for average ride distance by day for each user type
ggplot(ride_day_summary, aes(x = start_day, y = avg_ride_distance, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Day", y = "Average Ride Distance (Km)", title = "Average Ride Distance by User Type for different Days")

- Average Ride Distance is approximately same and consistent throughout the week for both type of users.
Creating a Histogram for Number of rides by hour for each user type
df_cleaned %>%
ggplot(aes(start_hour, fill= member_casual)) +
labs(x="Hour of the Day", y = "Number of Rides", title="Number of Rides by Hour") +
geom_bar()

- Number of Rides for both type of users increased from 5am-5pm and decreased from 5pm-5am the next day.
- Number of Rides were highest between 2pm-7pm for both type of users.
Creating Histograms for Number of rides by hour and each day of the week by each user type

- There is a lot of diferrence between the weekdays and weekends.
- There is a big increase of volume in the weekdays between 5am to 10am and another volume increase from 3pm to 7pm.
- It can be hypothesized that Annual Members use the bikes as a part of daily routine like going to work (same behaviour throughout the weekdays) and go back from work (5pm-7pm).
- Weekends are completely different for Annual members and Casual Riders, as on Friday, Saturday and Sunday, there is huge peak in volume for the Casual Riders, and it can be hypothesized that the Casual Riders mostly use bike share for leisure activity on the weekends.
Creating a bar chart for number of rides as per the ride length bucket for each user type
ggplot(ride_bucket_summary, aes(x = ride_length_bucket, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Ride Length Bucket", y = "Number of Rides", title = "Number of Rides by Ride Length Bucket for each User Type")

- Number of Rides are highest for rides lasting < 10 Minutes and lowest for rides lasting > 90 Minutes.
- Both the type of users usually use bikes for shorter durations.
Creating a bar chart for average ride length as per the ride length bucket for each user type
ggplot(ride_bucket_summary, aes(x = ride_length_bucket, y = avg_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Ride Length Bucket", y = "Average Ride Length (Minutes)", title = "Average Ride Length by Ride Length Bucket for each User Type")

- Average Ride Length for both type of users is almost same for each ride bucket except for the rides > 90 Minutes.
Creating a bar chart for average ride distance as per the ride length bucket for each user type
ggplot(ride_bucket_summary, aes(x = ride_length_bucket, y = avg_ride_distance, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Ride Length Bucket", y = "Average Ride Distance (Km)", title = "Average Ride Distance by Ride Length Bucket for each User Type")

- Average Ride Distance is higher for the rides that lasted 30-45 Minutes and 45-60 Minutes.
Creating a bar chart for number of rides by type of ride for each user type
ggplot(df_cleaned, aes(x=rideable_type, fill=member_casual)) +
geom_bar() +
labs(x = "Bike Type", y = "Number of Rides", title = "Number of Rides by Bike Type for each User Type")

- Classic Bikes are used mostly by Annual Members whereas Casual Riders prefer Electric Bikes more.
- A few Docked Bikes are used only by Casual Riders.
Creating a bar chart for most number of rides from top 5 start stations

- Streeter Dr & Grand Ave is the most used Start Station.
Creating a bar chart for most number of rides from top 5 end stations

- Streeter Dr & Grand Ave is the most used End Station.
Creating and plotting maps using the coordinates data
Adding a new data frame only for the most popular routes > 200 rides
coordinates_df <- df_cleaned %>%
filter(start_lng != end_lng & start_lat != end_lat) %>%
group_by(start_lng, start_lat, end_lng, end_lat, member_casual, rideable_type) %>%
summarise(total_rides = n(), .groups="drop") %>%
filter(total_rides > 200)
coordinates_df
Creating two different data frames depending on user type
casual_riders <- coordinates_df %>% filter(member_casual == "Casual Rider")
member_riders <- coordinates_df %>% filter(member_casual == "Annual Member")
Setting up ggmap and map of Chicago (bbox, stamen map)
chicago <- c(left = -87.700424, bottom = 41.790769, right = -87.554855, top = 41.990119)
chicago_map <- get_stamenmap(bbox = chicago, zoom = 12, maptype = "terrain")
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
Map of Casual Riders
ggmap(chicago_map,darken = c(0.1, "white")) +
geom_point(casual_riders, mapping = aes(x = start_lng, y = start_lat, color=rideable_type), size = 2) +
coord_fixed(0.8) +
labs(title = "Most used routes by Casual riders",x=NULL,y=NULL) +
theme(legend.position="none")
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning: Removed 8 rows containing missing values (`geom_point()`).

- Casual Riders are mostly located near the coast, probabably they might be using bikes for leisure, tourist or sightseeing related rides.
Map of Annual Members
ggmap(chicago_map,darken = c(0.1, "white")) +
geom_point(member_riders, mapping = aes(x = start_lng, y = start_lat, color=rideable_type), size = 2) +
coord_fixed(0.8) +
labs(title = "Most used routes by Casual riders",x=NULL,y=NULL) +
theme(legend.position="none")
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning: Removed 50 rows containing missing values (`geom_point()`).

- Annual Members mostly use bikes all over the city including main city area and outside main center. It can be hypothesized that they usually travel for work purpose.
Main insights and conclusions
- Annual Members hold the bigger proportion of the Rotal Rides (60%).
- In all months there are more Annual Members users than the Casual Riders.
- For the Casual Riders, the biggest volume of data is on the weekends.
- There is a bigger volume of both type of users from the afternoon till the evening.
- It could be possible that the Annual Members use bikes for work purpose, this information can be backed by their bike usage in the colder months, where there is a significant drop in the number of Casual Riders in those months.
- Average Ride Distance is higher for the rides that lasted 30-45 Minutes and 45-60 Minutes.
- Number of Rides are highest for rides lasting < 10 Minutes and lowest for rides lasting > 90 Minutes.
- Both the type of users usually use bikes for shorter durations.
Difference in Bike usage pattern between Annual Members and Casual Riders
- Members have the bigger volume of data, except on Saturday and Sunday. On the weekends, Casuals Riders have the most data points.
- Casuals Riders have more Average Ride Length than the Annual Members. Average Ride Length of the Annual Members are mostly same and there is a slight increase during weekends.
- Annual Members have more preference for Classic Bikes, whereas Electric Bikes are mostly preferred by the Casual Riders.
- Casual Members have a more fixed use for bikes for routine activities, Whereas Casual Riders’ usage is different, mostly for activities during the weekends.
- Casual Member spend time near the coastal area, whereas Annual Members are scattered throughout the city.